Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: Nouma Date: Jul 21, 2012 9:32am
Forum: general Subject: Website disappears from archive after adding robots.txt?

Hello everyone,

I would like to ask, does the internet archive delete older copies of a archived website from it's archives after a new version of the website includes robots.txt and it disallows caching?

In other words, if the new version of some website contains robots.txt, but the older version (from previous owner of that domain) didn't contain this file, would the internet archive delete also the older copies?

I know a several websites that were "deleted" from the internet archive this way. I cannot open those archived websites anymore, because it shows some kind of robots.txt error, but I was able to open these websites from archive a few years ago, before the new owner uploaded this robots.txt file.

I really need to view those website once again. Is it even possible?

These are the website I have in mind:
www.arcanum1.com
www.troikagames.com

Thank you for reply.
Have a nice day.

Reply to this post
Reply [edit]

Poster: PiRSquared Date: Oct 6, 2014 12:53pm
Forum: general Subject: Re: Website disappears from archive after adding robots.txt?

Are these the same by any chance?
https://web.archive.org/*/http://arcanum1.game-alive.com/
http://www.terra-arcanum.com/sierra/

Reply to this post
Reply [edit]

Poster: Xrobinson Date: Oct 6, 2014 12:16pm
Forum: general Subject: Re: Website disappears from archive after adding robots.txt?

I brought this up today in another topic. Now I doubt I will ever get an answer.

It seems that yes, the past can automatically be removed from the archive by robots.

Reply to this post
Reply [edit]

Poster: user001 Date: Oct 6, 2014 12:37pm
Forum: general Subject: Re: Website disappears from archive after adding robots.txt?

Your answer is in the FAQ's
https://archive.org/about/faqs.php

The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Internet Archive Wayback Machine, you may find a site that is unavailable due to robots.txt (you will see a "robots.txt query exclusion error" message). Sometimes a web site owner will contact us directly and ask us to stop crawling or archiving a site, and we endeavor to comply with these requests. When you come accross a "blocked site error" message, that means that a siteowner has made such a request and it has been honored.

Currently there is no way to exclude only a portion of a site, or to exclude archiving a site for a particular time period only.

When a URL has been excluded at direct owner request from being archived, that exclusion is retroactive and permanent.